Goto

Collaborating Authors

 video search


A Challenge to Build Neuro-Symbolic Video Agents

Shah, Sahil, Goel, Harsh, Narasimhan, Sai Shankar, Choi, Minkyu, Sharan, S P, Akcin, Oguzhan, Chinchali, Sandeep

arXiv.org Artificial Intelligence

Modern video understanding systems excel at tasks such as scene classification, object detection, and short video retrieval. However, as video analysis becomes increasingly central to real-world applications, there is a growing need for proactive video agents for the systems that not only interpret video streams but also reason about events and take informed actions. A key obstacle in this direction is temporal reasoning: while deep learning models have made remarkable progress in recognizing patterns within individual frames or short clips, they struggle to understand the sequencing and dependencies of events over time, which is critical for action-driven decision-making. Addressing this limitation demands moving beyond conventional deep learning approaches. We posit that tackling this challenge requires a neuro-symbolic perspective, where video queries are decomposed into atomic events, structured into coherent sequences, and validated against temporal constraints. Such an approach can enhance interpretability, enable structured reasoning, and provide stronger guarantees on system behavior, all key properties for advancing trustworthy video agents. To this end, we present a grand challenge to the research community: developing the next generation of intelligent video agents that integrate three core capabilities: (1) autonomous video search and analysis, (2) seamless real-world interaction, and (3) advanced content generation. By addressing these pillars, we can transition from passive perception to intelligent video agents that reason, predict, and act, pushing the boundaries of video understanding.


AnyClip snaps up $47M for its video search and analytics technology – TechCrunch

#artificialintelligence

Video is, quite literally, what gets the world moving online these days, expected to account for 82% of all IP traffic this year. Today a startup that has built a set of tools to help better parse, index and ultimately discover that trove of content is announcing a big round of funding to expand its business after seeing 600% growth in the last year. AnyClip -- which combines artificial intelligence with more standard search tools to provide better video analytics for content providers to improve how those videos can be used and viewed -- has raised $47 million, money that it will be using to build out its platform and where it can be applied. The funding is being led by JVP, with La Maison, Bank Mizrahi and internal investors also participating. The company is not officially disclosing its valuation but has raised $70 million to date and I understand from reliable sources that it is around $300 million.

  Country:
  Industry: Media (0.54)

SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries

Li, Xirong, Zhou, Fangming, Xu, Chaoxi, Ji, Jiaqi, Yang, Gang

arXiv.org Artificial Intelligence

Retrieving unlabeled videos by textual queries, known as Ad-hoc Video Search (AVS), is a core theme in multimedia data management and retrieval. The success of AVS counts on cross-modal representation learning that encodes both query sentences and videos into common spaces for semantic similarity computation. Inspired by the initial success of previously few works in combining multiple sentence encoders, this paper takes a step forward by developing a new and general method for effectively exploiting diverse sentence encoders. The novelty of the proposed method, which we term Sentence Encoder Assembly (SEA), is two-fold. First, different from prior art that use only a single common space, SEA supports text-video matching in multiple encoder-specific common spaces. Such a property prevents the matching from being dominated by a specific encoder that produces an encoding vector much longer than other encoders. Second, in order to explore complementarities among the individual common spaces, we propose multi-space multi-loss learning. As extensive experiments on four benchmarks (MSR-VTT, TRECVID AVS 2016-2019, TGIF and MSVD) show, SEA surpasses the state-of-the-art. In addition, SEA is extremely ease to implement. All this makes SEA an appealing solution for AVS and promising for continuously advancing the task by harvesting new sentence encoders.


Is AI-powered video search becoming inevitable to security? - asmag.com

#artificialintelligence

Given the increasing affordability of equipment and growing awareness of security requirements, more and more cameras are being installed across the globe every day. While this is a good thing, the sheer volume of footages that come in makes it difficult for operators to find specific objects or people when needed. This is one area where artificial intelligence (AI) is all set to play a key role. Several security companies are already working on this. Make searching through videos as simple as using Google.


Video scanning technology is being transformed by machine learning

#artificialintelligence

With the advent and popularity of video content gaining giant strides by each day, the demand for need to make video content search enabled is also increasing. The overall task is simple – creating machine readable semantic metadata of the videos that can be analyzed using text mining techniques. But this task is a very challenging one. Not only does it require processing of video content at scale but also the preferred approach of breaking down videos into still frames, aka images, has its own challenges. The biggest one being processing 30 frames per second is a trash intensive process that certainly demands lookout for better approaches.


Video search by deep-learning

#artificialintelligence

Aircraft Beach Mountain People marching Police/Security Flower 12. 12 From University-lab to spin-off and your mobile phone • 1000 others * UvA / Euvision / Qualcomm Universities win Start-ups win Snoek et al., TRECVID 2004-2015 13. 13 Latest jump due to deep learning 2006 2009 2015 Meanaverageprecision Progress in video recognition 14. 14 The more features the better Typical shallow learning architecture e.g. That is: given a video, can we find the best matching sentence? That is: given a video, can we find the best matching sentence?


Sparse Transfer Learning for Interactive Video Search Reranking

Tian, Xinmei, Tao, Dacheng, Rui, Yong

arXiv.org Machine Learning

Visual reranking is effective to improve the performance of the text-based video search. However, existing reranking algorithms can only achieve limited improvement because of the well-known semantic gap between low level visual features and high level semantic concepts. In this paper, we adopt interactive video search reranking to bridge the semantic gap by introducing user's labeling effort. We propose a novel dimension reduction tool, termed sparse transfer learning (STL), to effectively and efficiently encode user's labeling information. STL is particularly designed for interactive video search reranking. Technically, it a) considers the pair-wise discriminative information to maximally separate labeled query relevant samples from labeled query irrelevant ones, b) achieves a sparse representation for the subspace to encodes user's intention by applying the elastic net penalty, and c) propagates user's labeling information from labeled samples to unlabeled samples by using the data distribution knowledge. We conducted extensive experiments on the TRECVID 2005, 2006 and 2007 benchmark datasets and compared STL with popular dimension reduction algorithms. We report superior performance by using the proposed STL based interactive video search reranking.